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Abstract 

Information integration applications, such as mediators or 
mashups, that require access to information resources cur- 
rently rely on users manually discovering and integrating 
them in the application. Manual resource discovery is a slow 
process, requiring the user to sift through results obtained 
via keyword-based search. Although search methods have 
advanced to include evidence from document contents, its 
metadata and the contents and link structure of the referring 
pages, they still do not adequately cover information sources 
— often called "the hidden Web" — that dynamically gen- 
erate documents in response to a query. The recently popu- 
lar social bookmarking sites, which allow users to annotate 
and share metadata about various information sources, pro- 
vide rich evidence for resource discovery. In this paper, we 
describe a probabilistic model of the user annotation process 
in a social bookmarking system del.icio.us. We then use the 
model to automatically find resources relevant to a partic- 
ular information domain. Our experimental results on data 
obtained from del.icio.us show this approach as a promising 
method for helping automate the resource discovery task. 



Introduction 

As the Web matures, an increasing number of dynamic 
information sources and services come online. Unlike 
static Web pages, these resources generate their contents 
dynamically in response to a query. They can be HTML- 
based, searching the site via an HTML form, or be a Web 
service. Proliferation of such resources has led to a number 
of novel applications, including Web-based mashups, such 
as Google maps and Yahoo pipes, information integra- 
tion applications ( |Thakkar, Ambite, & Knoblock 2005 ) 
and intelligent office assistants 



( Lerman, Plangprasopchok, & Knoblock 2007 1 that com- 
pose information resources within the tasks they perform. In 
all these applications, however, the user must discover and 
model the relevant resources. Manual resource discovery 
is a very time consuming and laborious process. The 
user usually queries a Web search engine with appropriate 
keywords and additional parameters (e.g., asking for .kml or 
.wsdl files), and then must examine every resource returned 
by the search engine to evaluate whether it has the desired 



functionality. Often, it is desirable to have not one but 
several resources with an equivalent functionality to ensure 
robustness of information integration applications in the 
face of resource failure. Identifying several equivalent 
resources makes manual resource discovery even more time 
consuming. 

The majority of the research in this area of in- 
formation integration has focused on automating 
modeling resources — i.e., understanding seman- 
tics of data they use (HeB & Kushmerick 2003 ; 
Lerman, Plangprasopchok, & Knoblock 2006 1 and the 
functionality they provide (ICarman &~K noblock 2007J. In 
comparison, the resource discovery problem has received 
much less attention. Note that traditional search engines, 
which index resources by their contents — the words or 
terms they contain — are not likely to be useful in this 
domain, since the contents is dynamically generated. At 
best, they rely on the metadata supplied by the resource 
authors or the anchor text in the pages that link to the 
resource. Woogle ( Dong et al. 2004) is one of the few 
search engines to index Web services based on the syntactic 
metadata provided in the WSDL files. It allows a user to 
search for services with a similar functionality or that accept 
the same inputs as another services. 

Recently, a new generation of Web sites has rapidly 
gained popularity. Dubbed "social media," these sites al- 
low users to share documents, including bookmarks, photos, 
or videos, and to tag the content with free-form keywords. 
While the initial purpose of tagging was to help users or- 
ganize and manage their own documents, it has since been 
proposed that collective tagging of common documents can 
be used to organize information via an informal classifica- 
tion system dubbed a "folksonomy" (Math es 20041) . Con- 
sider, for example, http://geocoder.us, a geocoding service 
that takes an input as address and returns its latitude and 
longitude. On the social bookmarking site del.icio.UQ this 
resource has been tagged by more than 1, 000 people. The 
most common tags associated by users with this resource are 
"map," "geocoding," "gps," "address," "latitude," and "lon- 
gitude." This example suggests that although there is gener- 
ally no controlled vocabulary in a social annotation system, 
tags can be used to categorize resources by their functional- 
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ity. 

We claim that social tagging can be used for information 
resource discovery. We explore three probabilistic gener- 
ative models that can be used to describe the tagging pro- 
cess on del.icio.us. The first model is the probabilistic 
Latent Semantic model (Hofmann 19991 1 which ignores in- 
dividual user by integrating bookmarking behaviors from 
all users. The second model, the three-way aspect model, 
was proposed ( |Wu, Zhang, & Yu~2 006 ) to model del.icio.us 
users' annotations. The model assumes that there exists a 
global conceptual space that generates the observed values 
for users, resources and tags independently. We propose 
an alternative third model, motivated by the Author- Topic 
model (R osen-Zvi et al. 20041 1, which maintains that latent 
topics that are of interest to the author generate the words in 
the documents. Since a single resource on del.icio.us could 
be tagged differently by different users, we separate "top- 
ics", as defined in Author- Topic model, into "(user) inter- 
ests" and "(resource) topics". Together user interests and 
resource topics generate tags for one resource. In order to 
use the models for resource discovery, we describe each re- 
source by a topic distribution and then compare this distri- 
bution with that of all other resources in order to identify 
relevant resources. 

The paper is organized as follows. In the next section, 
we describe how tagging data is used in resource discovery. 
Subsequently we present the probabilistic model we have 
developed to aid in the resource discovery task. The section 
also describes two earlier related models. We then compares 
the performance of the three models on the datasets obtained 
from del.icio.us. We review prior work and finally present 
conclusions and future research directions. 

Problem Definition 

Suppose a user needs to find resources that provide some 
functionality: e.g., a service that returns current weather 
conditions, or latitude and longitude of a given address. In 
order to improve robustness and data coverage of an appli- 
cation, we often want more than one resource with the nec- 
essary functionality. In this paper, for simplicity, we assume 
that the user provides an example resource, that we call a 
seed, and wants to find more resources with the same func- 
tionality. By "same" we mean a resource that will accept the 
same input data types as the seed, and will return the same 
data types as the seed after applying the same operation to 
them. Note that we could have a more stringent requirement 
that the resource return the same data as the seed for the 
same input, but we don't want to exclude resources that may 
have different coverage. 

We claim that users in a social bookmarking system such 
as del.icio.us annotate resources according to their function- 
ality or topic (category). Although del.icio.us and similar 
systems provide different means for users to annotate doc- 
ument, such as notes and tags, in this paper we focus on 
utilizing the tags only. Thus, the variables in our model are 
resources R, users U and tags T. A bookmark i of resource 
r by user u can be formalized as a tuple (r, u, {ti, t2, ■ ■ ■})%, 
which can be further broken down into a co-occurrence of a 
triple of a resource, a user and a tag: (r, u, t). 
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Figure 1: Graphical representations of the probabilistic La- 
tent Semantic Model (left) and Multi-way Aspect Model 
(right) R, U, T and Z denote variables "Resource", "User", 
"Tag" and "Topic" repectively. N t represents a number of 
tag occurrences for a particular resource; D represents a 
number of resources. Meanwhile, ATj, represents a number of 
all resource-user-tag co-occurrences in the social annotation 
system. Note that filled circles represent observed variables. 



We collect these triples by crawling del.icio.us. The sys- 
tem provides three types of pages: a tag page — listing all 
resources that are tagged with a particular keyword; a user 
page — listing all resources that have been bookmarked by 
a particular user; and a resource page — listing all the tags 
the users have associated with that resource, del.icio.us also 
provides a method for navigating back and forth between 
these pages, allowing us to crawl the site. Given the seed, 
we get what del.icio.us shows as the most popular tags as- 
signed by the users to it. Next we collect other resources 
annotated with these tags. For each of these we collect the 
resource-user-tag triples. We use these data to discover re- 
sources with the same functionality as the seed, as described 
below. 

Approach 

We use probabilistic models in order to find a compressed 
description of the collected resources in terms of topic de- 
scriptions. This description is a vector of probabilities of 
how a particular resource is likely to be described by dif- 
ferent topics. The topic distribution of the resource is sub- 
sequently used to compute similarity between resources us- 
ing Jensen-Shannon divergence dLin 19911 ). For the rest of 
this section, we describe the probabilistic models. We first 
briefly describe two existing models: the probabilistic La- 
tent Semantic Analysis (pLSA) model and the Three-Way 
Aspect model (MWA). We then introduce a new model that 
explicitly takes into account users' interests and resources' 
topics. We compare performance of these models on the 
three del.icio.us datasets. 

Probabilistic Latent Semantic Model (pLSA) 



Hoffman (Hofman n 19991 ) proposed a probabilistic la- 
tent semantic model for associating word-document co- 



occurrences. The model hypothesized that a particular docu- 
ment is composed of a set of conceptual themes or topics Z. 
Words in a document were generated by these topics with 
some probability. We adapted the model to the context of 
social annotation by claiming that all users have common 
agreement on annotating a particular resource. All book- 
marks from all users associated with a given resource were 
aggregated into a single corpus. Figure Q] shows the graphi- 
cal representation of this model. Co-occurrences of a partic- 
ular resource-tag pair were computed by summing resource- 
user-tag triples (r, u, t) over all users. The joint distribution 
over resource and tag is 



K r >*) = ^p{A z )p( z \ r )p{ r ) 



(1) 



In order to estimate parameters p(t\z), p(z\r), p(r) we 
define log likelihood L, which measures how the estimated 
parameters fit the observed data 



L = n(r, t)log(p(r, t)) 



(2) 



where n(r,t) is a number of resource-tag co-occurrences. 
The EM algorithm (Dempster, Laird, & Rubin 1977 1 was 
applied to estimate those parameters that maximize L. 

Three-way Aspect Model (MWA) 

The three-way aspect model (or multi-way aspect model, 
MWA) was originally applied to document recommenda- 
tion systems ( |Popescul et al. 2001) , involving 3 entities: 
user, document and word. The model takes into account 
both user interest (pure collaborative filtering) and docu- 
ment content (content-based). Recently, the three-way as- 
pect model was applied on social annotation data in or- 
der to demonstrate emergent semantics in a social annota- 
tion system and to use these semantics for information re- 
trieval ( |Wu, Zhang, & Yu 2006) >. In this model, conceptual 
space was introduced as a latent variable, Z, which indepen- 
dently generated occurrences of resources, users and tags for 
a particular triple (r, u, t) (see Figure[TJ. The joint distribu- 
tion over resource, user, and tag was defined as follows 



p(r,u,t) 



p{r\z)p(u\z)p(t\z)p(z) 



(3) 



Similar to pLSA, the parameters p(r\z), p(u\z), p{t\z) 
and p(z) were estimated by maximizing the log likelihood 
objective function, L = J2 r . n(r, u, t)log(p(r, u, t)). EM 
algorithm was again applied to estimate those parameters. 

Interest-Topic Model (ITM) 

The motivation to implement the model proposed in this pa- 
per comes from the observation that users in a social anno- 
tation system have very broad interests. A set of tags in a 
particular bookmark could reflect both users' interests and 
resources' topics. As in the three-way aspect model, using a 
single latent variable to represent both "interests" and "top- 
ics" may not be appropriate, as intermixing between these 
two may skew the final similarity scores computed from the 
topic distribution over resources. 




Figure 2: Graphical representation on the proposed model. 
R, U, T, I and Z denote variables "Resource", "User", 
"Tag", "Interest" and "Topic" repectively. N t represents a 
number of tag occurrences for a one bookmark (by a partic- 
ular user to a particular resource); D represents a number of 
all bookmarks in social annotation system. 

Instead, we propose to explicitly separate the latent vari- 
ables into two: one representing user interests, J; another 
representing resource topics, Z. According to the proposed 
model, the process of resource-user-tag co-occurrence could 
be described as a stochastic process: 

• User u finds a resource r interesting and she would like to 
bookmark it 

• User u has her own interest profile i; meanwhile the re- 
source has a set of topics z. 

• Tag t is then chosen based on users's interest and re- 
source's topic 

The process is depicted in a graphical form in Figure [2] 
From the process described above, the joint probability of 
resource, user and tag is written as 

P(r,u,t) = ^2p(t\i, z)p(i\u)p(z\r)p(u)p(r) (4) 

Log likelihood L is used as the objective function to es- 
timate all parameters. Note that p(u) and p(r) could be 
obtained directly from observed data - the estimation thus 
involves three parameters p(t\i, z), p(i\u) and p(z\r). L is 
defined as 



L=^2 n ( r ' u > t)log(p(r, u, t)) 



(5) 



EM algorithm is applied to estimate these parameters. In 
the expectation step, the joint probability of hidden variables 
/ and Z given all observations is computed as 

i ■ I ,n P(t\i, z)p(i\u)p(z\r) 
P[h z\u, r, t) = — (6) 

Subsequently, each parameter is re-estimated using 
p(i, z\u, r, t) we just computed from the E step 



, $2 n(r,u,t)p(l,z\u,r,t) 
PW, z) = = (7) 

, Er.t^'^'OE^KMkM) 

p(i u = ■ — (8) 

n(u) 

, , s I2u,t n ( r , u i t )Y ll P(h z \u,r,t) 

p(z\r) = '■ — (9) 

nyr) 

The algorithm iterates between E and M step until the log 
likelihood or all parameter values converges. 

Once all the models are learned, we use the distribution 
of topics of a resource p(z\r) to compute similarity between 
resources and the seed using Jensen-Shannon divergence. 

Empirical Validation 

To evaluate our approach, we collected data for three seed 
resources: flytecomm^ geocode^ and wunderground^ The 
first resource allows users to track flights given the airline 
and flight number or departure and arrival airports; the sec- 
ond resource returns coordinates of a given address; while, 
the third resource supplies weather information for a partic- 
ular location (given by zipcode, city and state, or airport). 
Our goal is to find other resources that provide flight track- 
ing, geocoding and weather information. Our approach is 
to crawl del.icio.us to gather resources possibly related to 
the seed; apply the probabilistic models to find the topic dis- 
tribution of the resources; then rank all gathered resources 
based on how similar their topic distribution is to the seed's 
topic distribution. The crawling strategy is defined as fol- 
lows: for each seed 

• Retrieve the 20 most popular tags that users have applied 
to that resource 

• For each of the tags, retrieve other resources that have 
been annotated with that tag 

• For each resource, collect all bookmarks that have been 
created for it (i.e., resource-user-tag triples) 

We wrote special-purpose Web page scrapers to extract this 
information from del.icio.us. In principle, we could continue 
to expand the collection of resources by gathering tags and 
retrieving more resources that have been tagged with those 
tags, but in practice, even after the small traversal we do, we 
obtain more than 10 million triples for the wunderg round 
seed. 

We obtained the datasets for the seeds flytecomm and 
geocoder in May 2006 and for the seed wunderground in 
January 2007. We reduced the dataset by omitting low 
(fewer than ten) and high (more than ten thousand) fre- 
quency tags and all the triples associated with those tags. 
After this reduction, we were left with (a) 2,284,308 triples 
with 3,562 unique resources; 14,297 unique tags; 34,594 
unique users for the flytecomm seed; (b) 3,775,832 triples 
with 5,572 unique resources; 16,887 unique tags and 46,764 



unique users for the geocoder seed; (c) 6,327,211 triples 
with 7,176 unique resources; 77,056 unique tags and 45,852 
unique users for the wunderground seed. 

Next, we trained all three models on the data: pLSA, 
MWA and ITM. We then used the learned topic distributions 
to compute the similarity of the resources in each dataset to 
the seed, and ranked the resources by similarity. We evalu- 
ated the performance of each model by manually checking 
the top 100 resources produced by the model according to 
the criteria below: 

• same: the resource has the same functionality if it pro- 
vides an input form that takes the same type of data as the 
seed and returns the same type of output data: e.g., a flight 
tracker takes a flight number and returns flight status 

• link-to: the resource contains a link to a page with the 
same functionality as the seed (see criteria above). We 
can easily automate the step that check the links for the 
right functionality. 

Although evaluation is performed manually now, 
we plan to automate this process in the future by 
using the form's metadata to predict semantic types 
of inputs (HeB & Kushmerick 2003 ), automatically 
query the source, extract data from it and classify it 
using the tools described in (IGa zen & Minton 2005 1 
Lerman, Plangprasopchok, & Knoblock 2006). We will 



http://www.flytecomm.com/cgi-bin/trackflight/ 
3 http://geocoder.us 
4 http://www. wunderground.com/ 



then be able to validate that the resource has functionality 
similar to the seed by comparing its input and output data 
with that of the seed (ICarman & Knoblock 20071) . Note that 
since each step in the automatic query and data extraction 
process has some probability of failure, we will need to 
identify many more relevant resources than required in 
order to guarantee that we will be able to automatically 
verify some of them. 

Figure [3] shows the performance of different models 
trained with either 40 or 100 topics (and interests) on the 
three datasets. The figure shows the number of resources 
within the top 100 that had the same functionality as the 
seed or contained a link to a resource with the same func- 
tionality. The Interest-Topic model performed slightly bet- 
ter than pLSA, while both ITM and pLSA significantly out- 
performed the MWA model. Increasing the dimensionality 
of the latent variable Z from 40 to 100 generally improved 
the results, although sometimes only slightly. Google's find 
"Similar pages" functionality returned 28, 29 and 15 re- 
sources respectively for the three seeds flytecomm, geocoder 
and wunderground, out of which 5, 6, and 13 had the same 
functionality as the seed and 3, 0, had a link to a resource 
with the same functionality. The ITM model, in comparison, 
returned three to five times as many relevant results. 

Table Q] provides another view of performance of differ- 
ent resource discovery methods. It shows how many of the 
method's predictions have to be examined before ten re- 
sources with correct functionality are identified. Since the 
ITM model ranks the relevant resources highest, fewer Web 
sites have to be examined and verified (either manually or 
automatically); thus, ITM is the most efficient model. 

One possible reason why ITM performs slightly better 
than pLSA might be because in the datasets we collected, 




pLSA pLSA MWA MWA ITM ITM pLSA pLSA MWA MWA ITM ITM pLSA pLSA MWA MWA ITM ITM 
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Figure 3: Performance of different models on the three datasets. Each model was trained with 40 or 100 topics. For ITM, we 
fix interest to 20 interests across all different datasets. The bars show the number of resources within the top 100 returned by 
each model that had the same functionality as the seed or contained a link to a resource with the same functionality as the seed. 



there is low variance of user interest. The resources were 
gathered starting from a seed and following related tag links; 
therefore, we did not obtain any resources that were anno- 
tated with different tags than the seed, even if they are tagged 
by the same user who bookmarks the seed. Hence user- 
resource co-occurrences are incomplete: they are limited by 
a certain tag set. pLSA and ITM would perform similarly if 
all users had the same interests. We believe that ITM would 
perform significantly better than pLSA when variation of 
user interest is high. We plan to gather more complete data 
to weigh ITM behavior in more detail. 

Although performances pLSA and ITM are only slightly 
different, pLSA is much better than ITM in terms of effi- 
ciency since the former ignores user information and thus 
reduces iterations required in its training process. However, 
for some applications, such as personalized resource discov- 
ery, it may be important to retain user information. For such 
applications the ITM model, which retains this information, 
may be preferred over pLSA. 

Previous Research 

Popular methods for finding documents relevant to a user 
query rely on analysis of word occurrences (including meta- 
data) in the document and across the document collection. 
Information sources that generate their contents dynamically 
in response to a query cannot be adequately indexed by con- 
ventional search engines. Since they have sparse metadata, 



the user has to find the correct search terms in order to get 
results. 





PLSA 


MWA 


ITM 


GOOGLE* 


flytecomm 


23 


65 


15 


> 28 


geocoder 


14 


44 


16 


> 29 


wunderground 


10 


14 


10 


10 



Table 1 : The number of top predictions that have to be exam- 
ined before the system finds ten resources with the desired 
functionality (the same or link-to). Each model was trained 
with 100 topics. For ITM, we fixed the number of interests 
at 20. *Note that Google returns only 8 and 6 positive re- 
sources out of 28 and 29 retrieved resources for flytecomm 
and geocoder dataset respectively. 



A recent research (Dong et al. 2004 1 proposed to utilize 
metadata in the Web services' WSDL and UDDI files in or- 
der to find Web services offering similar operations in an 
unsupervised fashion. The work is established on a heuris- 
tic that similar operations tend to be described by similar 
terms in service description, operation name and input and 
output names. The method uses clustering techniques using 
cohesion and correlation scores (distances) computed from 
co-occurrence of observed terms to cluster Web service op- 
erations. In this approach, a given operation can only belong 
to a single cluster. Meanwhile, our approach is grounded on 
a probabilistic topic model, allowing a particular resource to 
be generated by several topics, which is more intuitive and 
robust. In addition, it yields a method to determine how the 
resource similar to others in certain aspects. 

Although our objective is similar, instead of words or 
metadata created by the authors of online resources, our ap- 
proach utilizes the much denser descriptive metadata gen- 
erated in a social bookmarking system by the readers or 
users of these resources. One issue to be considered is 
the metadata cannot be directly used for categorizing re- 
sources since they come from different user views, interests 
and writing styles. One needs algorithms to detect patterns 
in these data, find hidden topics which, when known, will 
help to correctly group similar resources together. We apply 
and extend the probabilistic topic model, in particular pLS A 
dHofmann 19991 1 to address such issue. 

Our model is conceptually motivated by the Author- Topic 
model (IRosen-Zvi et al. 20041 ), where we can view a user 
who annotate a resource as an author who composes a docu- 
ment. The aim in that approach is to learn topic distribution 
for a particular author; while our goal is to learn the topic 
distribution for a certain resource. Gibbs sampling was used 
in parameter estimation for that model; meanwhile, we use 
the generic EM algorithm to estimate parameters, since it is 
analytically straightforward and ready to be implemented. 

The most relevant work, (Wu, Zhang, & Yu 2006), uti- 
lizes multi-way aspect model on social annotation data in 
del.icio.us. The model doesn't explicitly separate user in- 



terests and resources topics as our model does. Moreover, 
the work focuses on emergence of semantic and personal- 
ized resource search, and is evaluated by demonstrating that 
it can alleviate a problem of tag sparseness and synonymy in 
a task of searching for resources by a tag. In our work, on 
the other hand, our model is applied to search for resources 
similar to a given resource. 

There is another line of researches on resource discov- 
ery that exploits social network information of the web 
graph. Google ( |Brin & Page 1998} uses visitation rate ob- 
tained from resources' connectivity to measure their popu- 
larity. HITS ( |Kleinberg 19 99 ) also use web graph to rate rel- 
evant resources by measuring their authority and hub values. 
Meanwhile, ARC dChakrabarti et al. 19981 1 extends HITS 
by including content information of resource hyperlinks to 
improve system performance. Although the objective is 
somewhat similar, our work instead exploits resource meta- 
data generated by community to compute resources' rele- 
vance score. 

Conclusion 

We have presented a probabilistic model that models social 
annotation process and described an approach to utilize the 
model in the resource discovery task. Although we can- 
not compare to performance to state-of-the-art search en- 
gine directly, the experimental results show the method to 
be promising. 

There remain many issues to pursue. First, we would like 
to study the output of the models, in particular, what the user 
interests tell us. We would also like to automate the source 
modeling process by identifying the resource's HTML form 
and extracting its metadata. We will then use techniques de- 
scribed in (He fi & Kushm erick 2003 ) to predict the seman- 
tic types of the resource's input parameters. This will enable 
us to automatically query the resource and classify the re- 
turned data using tools described in dGazen & Minton 2005 ; 
Lerman, Plangprasopch ok7& Knoblock 2006} . We will 
then be able to validate that the resource has the same func- 
tionality as the seed by comparing its input and output data 
with that of the seed (Carman & Knoblock 2007 ). This will 
allow agents to fully exploit our system for integrating in- 
formation across different resources without human inter- 
vention. 

Our next goal is to generalize the resource discovery 
process so that instead of starting with a seed, a user can 
start with a query or some description of the information 
need. We will investigate different methods for translating 
the query into tags that can be used to harvest data from 
del.icio.us. In addition, there is other evidence potentially 
useful for resource categorization such as user comments, 
content and input fields in the resource. We plan to extend 
the present work to unify evidence both from annotation and 
resources' content to improve the accuracy of resource dis- 
covery. 
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